<h1>Strings in Go</h1>

<p>
Like many other programming languages, string is also one important kind of types in Go.
This article will list all the facts of strings.
</p>

<h3>The Internal Structure of String Types</h3>

<div>
For the standard Go compiler,
the internal structure of any string type is declared like:
<pre class="line-numbers"><code class="language-go">type _string struct {
	elements *byte // underlying bytes
	len      int   // number of bytes
}
</code></pre>
<p>
From the declaration, we know that a string is actually a <code>byte</code> sequence wrapper.
In fact, we can really view a string as an (element-immutable) byte slice.
</p>
</div>

<p>
Note, in Go, <code>byte</code> is a built-in alias of type <code>uint8</code>.
</p>

<h3>Some Simple Facts About Strings</h3>

<div>

We have learned the following facts about strings from previous articles.
<ul>
<li>
	String values can be used as constants (along with boolean and all kinds of numeric values).
</li>
<li>
	Go supports <a href="basic-types-and-value-literals.html#string-literals">two styles of
	string literals</a>, the double-quote style (or interpreted literals)
	and the back-quote style (or raw string literals).
</li>
<li>
	The zero values of string types are blank strings, which can be represented
	with <code>""</code> or <code>``</code> in literal.
</li>
<li>
	Strings can be concatenated with <code>+</code> and <code>+=</code> operators.
</li>
<li>
	String types are all comparable (by using the <code>==</code> and <code>!=</code> operators).
	And like integer and floating-point values, two values of the same string type can also be compared with
	<code>&gt;</code>, <code>&lt;</code>, <code>&gt;=</code> and <code>&lt;=</code> operators.
	When comparing two strings, their underlying bytes will be compared, one byte by one byte.
	If one string is a prefix of the other one and the other one is longer,
	then the other one will be viewed as the larger one.
</li>
</ul>

Example:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	const World = "world"
	var hello = "hello"

	// Concatenate strings.
	var helloWorld = hello + " " + World
	helloWorld += "!"
	fmt.Println(helloWorld) // hello world!

	// Compare strings.
	fmt.Println(hello == "hello")   // true
	fmt.Println(hello > helloWorld) // false
}
</code></pre>

<p>
</p>

More facts about string types and values in Go.
<ul>
<li>
	Like Java, the contents (underlying bytes) of string values are immutable.
	The lengths of string values also can't be modified separately.
	An addressable string value can only be overwritten as a whole by assigning another string value to it.
</li>
<li>
	The built-in <code>string</code> type has no methods (just like most other built-in types in Go), but we can
	<ul>
	<li>
		use functions provided in the <a href="https://golang.org/pkg/strings/"><code>strings</code> standard package</a>
		to do all kinds of string manipulations.
	</li>
	<li>
		call the built-in <code>len</code> function to get the length of a string (number of bytes stored in the string).
	</li>
	<li>
		use the element access syntax <code>aString[i]</code> introduced in
		<a href="container.html#element-access">container element accesses</a>
		to get the <b>i</b><i>th</i> <code>byte</code> value stored in <code>aString</code>.
		The expression <code>aString[i]</code> is not addressable.
		In other words, value <code>aString[i]</code> can't be modified.
	</li>
	<li>
		use <a href="container.html#subslice">the subslice syntax</a> <code>aString[start:end]</code>
		to get a substring of <code>aString</code>.
		Here, <code>start</code> and <code>end</code> are both indexes of bytes stored in <code>aString</code>.
	</li>
	</ul>
</li>
<li>
	For the standard Go compiler,
	the destination string variable and source string value in a string assignment will
	share the same underlying byte sequence in memory.
	The result of a substring expression <code>aString[start:end]</code>
	also shares the same underlying byte sequence with the base string
	<code>aString</code> in memory.
</li>
</ul>

<p>
Note, if <code>aString</code> and the indexes in expressions <code>aString[i]</code> and <code>aString[start:end]</code> are all constants,
then out-of-range constant indexes will make compilations fail.
And please note that the evaluation results of such expressions are always non-constants.
</p>

Example:
<pre class="line-numbers"><code class="language-go">package main

import (
	"fmt"
	"strings"
)

func main() {
	var helloWorld = "hello world!"

	var hello = helloWorld[:5] // substring
	// 104 is the ASCII code (and Unicode) of char 'h'.
	fmt.Println(hello[0])         // 104
	fmt.Printf("%T \n", hello[0]) // uint8

	// hello[0] is unaddressable and immutable,
	// so the following two lines fail to compile.
	/*
	hello[0] = 'H'         // error
	fmt.Println(&hello[0]) // error
	*/

	// The next statement prints: 5 12 true
	fmt.Println(len(hello), len(helloWorld),
			strings.HasPrefix(helloWorld, hello))
}
</code></pre>
</div>

<h3>String Encoding and Unicode Code Points</h3>

<p>
Unicode standard specifies a unique value for each character in all kinds of human languages.
But the basic unit in Unicode is not character, it is code point instead.
For most code points, each of them corresponds to a character,
but for a few characters, each of them consists of several code points.
</p>

<p>
Code points are represented as <a href="basic-types-and-value-literals.html#rune">rune values</a> in Go.
In Go, <code>rune</code> is a built-in alias of type <code>int32</code>.
</p>

<p>
In applications, there are several encoding methods to represent code points, such as UTF-8 encoding and UTF-16 encoding.
Nowadays, the most popularly used encoding method is UTF-8 encoding.
In Go, all string constants are viewed as UTF-8 encoded.
At compile time, illegal UTF-8 encoded string constants will make compilation fail.
However, at run time, Go runtime can't prevent some strings from being illegally UTF-8 encoded.
</p>

<p>
For UTF-8 encoding, each code point value may be stored as one or more bytes (up to four bytes).
For example, each English code point (which corresponds to one English character) is stored as one byte,
however each Chinese code point (which corresponds to one Chinese character) is stored as three bytes.
</p>

<a class="anchor" id="conversions"></a>
<h3>String Related Conversions</h3>

<div>
<p>
In the article <a href="constants-and-variables.html#explicit-conversion">constants and variables</a>,
we have learned that integers can be explicitly converted to strings (but not vice versa).
</p>

Here introduces two more string related conversions rules in Go:
<ol>
<li>
	a string value can be explicitly converted to a byte slice, and vice versa.
	A byte slice is a slice whose underlying type is <code>[]byte</code>
	(a.k.a., <code>[]uint8</code>).
</li>
<li>
	a string value can be explicitly converted to a rune slice, and vice versa.
	A rune slice is a slice whose underlying type is <code>[]rune</code>
	(a.k.a., <code>[]int32</code>).
</li>
</ol>

<p>
In a conversion from a rune slice to string,
each slice element (a rune value) will be UTF-8 encoded as from one to four bytes
and stored in the result string.
If a slice rune element value is outside the range of valid Unicode code points,
then it will be viewed as <code>0xFFFD</code>, the code point for the Unicode replacement character.
<code>0xFFFD</code> will be UTF-8 encoded as three bytes (<code>0xef 0xbf 0xbd</code>).
</p>

<p>
When a string is converted to a rune slice, the bytes stored in the string
will be viewed as successive UTF-8 encoding byte sequence representations
of many Unicode code points. Bad UTF-8 encoding representations will be
converted to a rune value <code>0xFFFD</code>.
</p>

<p>
When a string is converted to a byte slice,
the result byte slice is just a deep copy of
the underlying byte sequence of the string.
When a byte slice is converted to a string,
the underlying byte sequence of the result string
is also just a deep copy of the byte slice.
A memory allocation is needed to store the deep copy
in each of such conversions.
The reason why a deep copy is essential is slice elements are mutable
but the bytes stored in strings are immutable,
so a byte slice and a string can't share byte elements.
</p>

Please note, for conversions between strings and byte slices,
<ul>
<li>
	illegal UTF-8 encoded bytes are allowed and will keep unchanged.
</li>
<li>
	the standard Go compiler makes some optimizations
	for some special cases of such conversions,
	so that the deep copies are not made.
	Such cases will be introduced below.
</li>
</ul>

Conversions between byte slices and rune slices are not supported directly in Go,
We can use the following ways to achieve this goal:
<ul>
<li>
	use string values as a hop. This way is convenient but not very efficient, for two deep copies are needed in the process.
</li>
<li>
	use the functions in <a href="https://golang.org/pkg/unicode/utf8/">unicode/utf8</a> standard package.
</li>
<li>
	use <a href="https://golang.org/pkg/bytes/#Runes">the
	<code>Runes</code> function in the bytes standard package</a>
	to convert a <code>[]byte</code> value to a <code>[]rune</code> value.
	There is not a function in this package to convert a rune slice to byte slice.
</li>
</ul>

Example:
<pre class="line-numbers"><code class="language-go">package main

import (
	"bytes"
	"unicode/utf8"
)

func Runes2Bytes(rs []rune) []byte {
	n := 0
	for _, r := range rs {
		n += utf8.RuneLen(r)
	}
	n, bs := 0, make([]byte, n)
	for _, r := range rs {
		n += utf8.EncodeRune(bs[n:], r)
	}
	return bs
}

func main() {
	s := "Color Infection is a fun game."
	bs := []byte(s) // string -> []byte
	s = string(bs)  // []byte -> string
	rs := []rune(s) // string -> []rune
	s = string(rs)  // []rune -> string
	rs = bytes.Runes(bs) // []byte -> []rune
	bs = Runes2Bytes(rs) // []rune -> []byte
}
</code></pre>

<p>
</p>

<!--
https://github.com/golang/go/issues/23536
todo: if later the meaning of "byte slice" is changed.
      The beginning of this article may be also
      need changed. "a <code>byte</code> sequence" should be changed
      to "a byte sequence".

Maybe it is not intentional, for the standard Go compiler
(at least up to version 1.11),
it looks a string can also be converted to a slice type whose
element type's underlying type is <code>byte</code>,
but not vice versa.

<pre class="line-numbers"><code class="language-go">package main

func main() {
	type MyByte byte
	str := "hello"
	bs := []MyByte(str) // compile okay
	str = string(bs)    // error: cannot use []MyByte as []byte
}
</code></pre>
-->

<p>
</p>
</div>

<a class="anchor" id="conversion-optimizations"></a>
<h3>Compiler Optimizations for Conversions Between Strings and Byte Slices</h3>

<div>
Above has mentioned that the underlying bytes in the conversions
between strings and byte slices will be copied.
The standard Go compiler makes some optimizations,
which are proven to still work in Go SDK 1.13,
for some special scenarios to avoid the duplicate copies.
These scenarios include:
<ul>
<li>
	a conversion (from string to byte slice) which
	follows the <code>range</code> keyword in a <code>for-range</code> loop.
</li>
<li>
	a conversion (from byte slice to string) which is used as a map key in map element indexing syntax.
</li>
<li>
	a conversion (from byte slice to string) which is used in a comparison.
</li>
<li>
	a conversion (from byte slice to string) which is used in
	a string concatenation, and at least one of concatenated
	string values is a non-blank string constant.
</li>
</ul>

Example:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	var str = "world"
	// Here, the []byte(str) conversion will
	// not copy the underlying bytes of str.
	for i, b := range []byte(str) {
		fmt.Println(i, ":", b)
	}

	key := []byte{'k', 'e', 'y'}
	m := map[string]string{}
	// Here, the string(key) conversion will not copy
	// the bytes in key. The optimization will be still
	// made, even if key is a package-level variable.
	m[string(key)] = "value"
	fmt.Println(m[string(key)]) // value
}
</code></pre>

Another example:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"
import "testing"

var s string
var x = []byte{1024: 'x'}
var y = []byte{1024: 'y'}

func fc() {
	// None of the below 4 conversions will
	// copy the underlying bytes of x and y.
	// Surely, the underlying bytes of x and y will
	// be still copied in the string concatenation.
	if string(x) != string(y) {
		s = (" " + string(x) + string(y))[1:]
	}
}

func fd() {
	// Only the two conversions in the comparison
	// will not copy the underlying bytes of x and y.
	if string(x) != string(y) {
		// Please note the difference between the
		// following concatenation and the one in fc.
		s = string(x) + string(y)
	}
}

func main() {
	fmt.Println(testing.AllocsPerRun(1, fc)) // 1
	fmt.Println(testing.AllocsPerRun(1, fd)) // 3
}
</code></pre>
</div>

<h3><code>for-range</code> on Strings</h3>

<div>
<p>
The <code>for-range</code> loop control flow applies to strings.
But please note, <code>for-range</code> will iterate the Unicode code points
(as <code>rune</code> values), instead of bytes, in a string.
Bad UTF-8 encoding representations in the string will be
interpreted as <code>rune</code> value <code>0xFFFD</code>.
</p>

Example:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i, rn := range s {
		fmt.Printf("%2v: 0x%x %v \n", i, rn, string(rn))
	}
	fmt.Println(len(s))
}
</code></pre>

The output of the above program:
<pre class="output"><code> 0: 0x65 e
 1: 0x301 ́
 3: 0x915 क
 6: 0x94d ्
 9: 0x937 ष
12: 0x93f ि
15: 0x61 a
16: 0x3c0 π
18: 0x56e7 囧
21
</code></pre>

From the output result, we can find that
<ol>
<li>the iteration index value may be not continuous. The reason is the index is the byte index in the ranged string and one code point may need more than one byte to represent.</li>
<li>the first character, <code>é</code>, is composed of two runes (3 bytes total)</li>
<li>the second character, <code>क्षि</code>, is composed of four runes (12 bytes total).</li>
<li>the English character, <code>a</code>, is composed of one rune (1 byte).</li>
<li>the character, <code>π</code>, is composed of one rune (2 bytes).</li>
<li>the Chinese character, <code>囧</code>, is composed of one rune (3 bytes).</li>
</ol>

Then how to iterate bytes in a string? Do this:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i := 0; i < len(s); i++ {
		fmt.Printf("The byte at index %v: 0x%x \n", i, s[i])
	}
}
</code></pre>
<p>
</p>

Surely, we can also make use of the compiler optimization mentioned above to iterate bytes in a string.
For the standard Go compiler, this way is a little more efficient than the above one.
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	// Here, the underlying bytes of s are not copied.
	for i, b := range []byte(s) {
		fmt.Printf("The byte at index %v: 0x%x \n", i, b)
	}
}
</code></pre>
<p>
</p>

<p>
From the above several examples, we know that
<code>len(s)</code> will return the number of bytes in string <code>s</code>.
The time complexity of <code>len(s)</code> is <code><i>O</i>(1)</code>.
How to get the number of runes in a string?
Using a <code>for-range</code> loop to iterate and count all runes is a way, and using the
<a href="https://golang.org/pkg/unicode/utf8/#RuneCountInString">RuneCountInString</a>
function in the <code>unicode/utf8</code> standard package is another way.
The efficiencies of the two ways are almost the same.
The third way is to use <code>len([]rune(s))</code> to get the count of runes in string <code>s</code>.
Since Go SDK 1.11, the standard Go compiler make an optimization for the third way
to avoid an unnecessary deep copy so that it is as efficient as the former two ways.
Please note that the time complexities of these ways are all <code><i>O</i>(n)</code>.
</p>

</div>

<a class="anchor" id="string-concatenation"></a>
<h3>More String Concatenation Methods</h3>

<div>
Besides using the <code>+</code> operator to concatenate strings, we can also use following ways to concatenate strings.
<ul>
<li>
	The <code>Sprintf</code>/<code>Sprint</code>/<code>Sprintln</code> functions
	in the <code>fmt</code> standard package
	can be used to concatenate values of any types, including string types.
</li>
<li>
	Use the <code>Join</code> function in the <code>strings</code> standard package.
</li>
<li>
	The <code>Buffer</code> type in the <code>bytes</code> standard package
	(or the built-in <code>copy</code> function) can be used to build byte slices,
	which afterwards can be converted to string values.
</li>
<li>
	Since Go 1.10, the <code>Builder</code> type in the <code>strings</code> standard package
	can be used to build strings. Comparing with <code>bytes.Buffer</code> way,
	this way avoids making an unnecessary duplicated copy of underlying bytes for the result string.
</li>
</ul>

<p>
The standard Go compiler makes optimizations for string concatenations by using the <code>+</code> operator.
So generally, using <code>+</code> operator to concatenate strings is convenient and efficient
if the number of the concatenated strings is known at compile time.
</p>
</div>

<a class="anchor" id="use-string-as-byte-slice"></a>
<h3>Sugar: Use Strings as Byte Slices</h3>

<div>

<p>
From the article <a href="container.html">arrays, slices and maps</a>,
we have learned that we can use the built-in <code>copy</code> and
<code>append</code> functions to copy and append slice elements.
In fact, as a special case,
if the first argument of a call to either of the two
functions is a byte slice, then the second argument can be a string
(if the call is an <code>append</code> call, then the string argument
must be followed by three dots <code>...</code>).
In other words, a string can be used as a byte slice for the special case.
</p>

Example:
<pre class="line-numbers"><code class="language-go">package main

import "fmt"

func main() {
	hello := []byte("Hello ")
	world := "world!"

	// The normal way:
	// helloWorld := append(hello, []byte(world)...)
	helloWorld := append(hello, world...) // sugar way
	fmt.Println(string(helloWorld))

	helloWorld2 := make([]byte, len(hello) + len(world))
	copy(helloWorld2, hello)
	// The normal way:
	// copy(helloWorld2[len(hello):], []byte(world))
	copy(helloWorld2[len(hello):], world) // sugar way
	fmt.Println(string(helloWorld2))
}
</code></pre>
</div>

<a class="anchor" id="comparison"></a>
<h3>More About String Comparisons</h3>

<div>
Above has mentioned that comparing two strings is comparing their underlying bytes actually.
Generally, Go compilers will made the following optimizations for string comparisons.
<ul>
<li>
	For <code>==</code> and <code>!=</code> comparisons,
	if the lengths of the compared two strings are not equal,
	then the two strings must be also not equal (no needs to compare their bytes).
</li>
<li>
	If their underlying byte sequence pointers of
	the compared two strings are equal,
	then the comparison result is the same as
	comparing the lengths of the two strings.
</li>
</ul>

<p>
So for two equal strings, the time complexity of comparing them
depends on whether or not their underlying byte sequence pointers are equal.
If the two equal string values don't share the same underlying bytes,
then the time complexity of comparing the two values is <code><i>O</i>(n)</code>,
where <code>n</code> is the length of the two strings,
otherwise, the time complexity is <code><i>O</i>(1)</code>.
</p>

<p>
As above mentioned, for the standard Go compiler, in a string value assignment,
the destination string value and the source string value will
share the same underlying byte sequence in memory.
So the cost of comparing the two strings becomes very small.
</p>

Example:
<pre class="line-numbers"><code class="language-go">package main

import (
	"fmt"
	"time"
)

func main() {
	bs := make([]byte, 1<<26)
	s0 := string(bs)
	s1 := string(bs)
	s2 := s1

	// s0, s1 and s2 are three equal strings.
	// The underlying bytes of s0 is a copy of bs.
	// The underlying bytes of s1 is also a copy of bs.
	// The underlying bytes of s0 and s1 are two
	// different copies of bs.
	// s2 shares the same underlying bytes with s1.

	startTime := time.Now()
	_ = s0 == s1
	duration := time.Now().Sub(startTime)
	fmt.Println("duration for (s0 == s1):", duration)

	startTime = time.Now()
	_ = s1 == s2
	duration = time.Now().Sub(startTime)
	fmt.Println("duration for (s1 == s2):", duration)
}
</code></pre>

Output:
<pre class="output"><code>duration for (s0 == s1): 10.462075ms
duration for (s1 == s2): 136ns
</code></pre>

<p>
1ms is 1000000ns! So please try to avoid comparing two long strings
if they don't share the same underlying byte sequence.
</p>
</div>


